AITopics

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neuhof, Bitya, Benjamini, Yuval

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

arXiv.org Machine LearningJun-9-2026

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

large language model, machine learning, rank cis, (18 more...)

arXiv.org Machine Learning

2606.08679

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsFeb-10-2026, 20:14:01 GMT

3ecca655ac67685fdc2155da0eceda6b-Paper-Datasets_and_Benchmarks.pdf

channel-adaptive model, dataset, generalization, (16 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Artificial IntelligenceDec-12-2025

Emergent Collective Memory in Decentralized Multi-Agent AI Systems

Khushiyant, null

We demonstrate how collective memory emerges in decentralized multi-agent systems through the interplay between individual agent memory and environmental trace communication. Our agents maintain internal memory states while depositing persistent environmental traces, creating a spatially distributed collective memory without centralized control. Comprehensive validation across five environmental conditions (20x20 to 50x50 grids, 5-20 agents, 50 runs per configuration) reveals a critical asymmetry: individual memory alone provides 68.7% performance improvement over no-memory baselines (1563.87 vs 927.23, p < 0.001), while environmental traces without memory fail completely. This demonstrates that memory functions independently but traces require cognitive infrastructure for interpretation. Systematic density-sweep experiments (rho in [0.049, 0.300], up to 625 agents) validate our theoretical phase transition prediction. On realistic large grids (30x30, 50x50), stigmergic coordination dominates above rho ~ 0.20, with traces outperforming memory by 36-41% on composite metrics despite lower food efficiency. The experimental crossover confirms the predicted critical density rho_c = 0.230 within 13% error.

agent, artificial intelligence, coordination, (16 more...)

2512.10166

Country: Europe (0.28)

Genre: Research Report > Experimental Study (0.89)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Sharmin, Shayla, Barmaki, Roghayeh Leila

Hybrid Deep Learning Model to Estimate Cognitive Effort from fNIRS Signals

arXiv.org Artificial IntelligenceOct-29-2025

This study estimates cognitive effort based on functional near-infrared spectroscopy data and performance scores using a hybrid DeepNet model. The estimation of cognitive effort enables educators to modify material to enhance learning effectiveness and student engagement. In this study, we collected oxygenated hemoglobin using functional near-infrared spectroscopy during an educational quiz game. Participants (n=16) responded to 16 questions in a Unity-based educational game, each within a 30-second response time limit. We used DeepNet models to predict the performance score from the oxygenated hemoglobin, and compared traditional machine learning and DeepNet models to determine which approach provides better accuracy in predicting performance scores. The result shows that the proposed CNN-GRU gives better performance with 73% than other models. After the prediction, we used the predicted score and the oxygenated hemoglobin to observe cognitive effort by calculating relative neural efficiency and involvement in our test cases. Our result shows that even with moderate accuracy, the predicted cognitive effort closely follow the actual trends. This findings can be helpful in designing and improving learning environments and provide valuable insights into learning materials.

artificial intelligence, cognitive effort, machine learning, (17 more...)

doi: 10.1145/3747327.3764901

2504.13883

Country: North America > United States > Delaware > New Castle County > Newark (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Willemsen, Floris-Jan, van Nieuwpoort, Rob V., van Werkhoven, Ben

Tuning the Tuner: Introducing Hyperparameter Optimization for Auto-Tuning

arXiv.org Artificial IntelligenceOct-10-2025

Abstract--Automatic performance tuning (auto-tuning) is widely used to optimize performance-critical applications across many scientific domains by finding the best program variant among many choices. Efficient optimization algorithms are crucial for navigating the vast and complex search spaces in auto-tuning. As is well known in the context of machine learning and similar fields, hyperparameters critically shape optimization algorithm efficiency. Y et for auto-tuning frameworks, these hyperparameters are almost never tuned, and their potential performance impact has not been studied. We present a novel method for general hyperparameter tuning of optimization algorithms for auto-tuning, thus "tuning the tuner". In particular, we propose a robust statistical method for evaluating hyperparameter performance across search spaces, publish a F AIR data set and software for reproducibility, and present a simulation mode that replays previously recorded tuning data, lowering the costs of hyperparameter tuning by two orders of magnitude. We show that even limited hyperparam-eter tuning can improve auto-tuner performance by 94.8% on average, and establish that the hyperparameters themselves can be optimized efficiently with meta-strategies (with an average improvement of 204.7%), demonstrating the often overlooked hyperparameter tuning as a powerful technique for advancing auto-tuning research and practice. UTOMA TIC performance tuning, or auto-tuning, is a widely established method for optimizing the performance of applications in many scientific domains, including radio astronomy [1]-[4], image processing [5]-[7], fluid dynamics [8]-[10], and climate modeling [11]-[13]. Auto-tuning automates the process of exploring the myriad of implementation choices that arise in performance optimization, such as the number of threads, tile sizes used in loop blocking, and other code optimization parameters [14]. At the heart of the auto-tuning method is a search space of functionally-equivalent code variants that is explored by an optimization algorithm.

artificial intelligence, hyperparameter, optimization problem, (15 more...)

doi: 10.1109/eScience65000.2025.00033

2509.263

Country: Europe > Netherlands (0.29)

Genre:

Overview (0.93)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Neural Information Processing SystemsOct-8-2025, 12:49:38 GMT

CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

However, there are many settings where the number of channels may vary, such as microscopy images where the number of channels changes depending on instruments and experimental goals.

chammi, channel-adaptive model, dataset, (16 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Tesic, Marko, Zhao, Yue, Leibo, Joel Z., Trivedi, Rakshit S., Hernandez-Orallo, Jose

Beyond the high score: Prosocial ability profiles of multi-agent populations

arXiv.org Artificial IntelligenceSep-19-2025

The development and evaluation of social capabilities in AI agents require complex environments where competitive and cooperative behaviours naturally emerge. While game-theoretic properties can explain why certain teams or agent populations outperform others, more abstract behaviours, such as convention following, are harder to control in training and evaluation settings. The Melting Pot contest is a social AI evaluation suite designed to assess the cooperation capabilities of AI systems. In this paper, we apply a Bayesian approach known as Measurement Layouts to infer the capability profiles of multi-agent systems in the Melting Pot contest. We show that these capability profiles not only predict future performance within the Melting Pot suite but also reveal the underlying prosocial abilities of agents. Our analysis indicates that while higher prosocial capabilities sometimes correlate with better performance, this is not a universal trend-some lower-scoring agents exhibit stronger cooperation abilities. Furthermore, we find that top-performing contest submissions are more likely to achieve high scores in scenarios where prosocial capabilities are not required. These findings, together with reports that the contest winner used a hard-coded solution tailored to specific environments, suggest that at least one top-performing team may have optimised for conditions where cooperation was not necessary, potentially exploiting limitations in the evaluation framework. We provide recommendations for improving the annotation of cooperation demands and propose future research directions to account for biases introduced by different testing environments. Our results demonstrate that Measurement Layouts offer both strong predictive accuracy and actionable insights, contributing to a more transparent and generalisable approach to evaluating AI systems in complex social settings.

artificial intelligence, bayesian inference, machine learning, (17 more...)

2509.14485

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Neural Information Processing SystemsAug-15-2025, 08:47:50 GMT

What Matters for Adversarial Imitation Learning?

Adversarial imitation learning has become a popular framework for imitation in continuous control.

configuration, percentile, performance score, (13 more...)

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceAug-14-2025

Motif 2.6B Technical Report

Lim, Junghwan, Lee, Sungmin, Kim, Dongseok, Park, Eunhwan, Park, Hyunbyung, Lee, Junhyeok, Cheung, Wai Ting, Choi, Dahye, Her, Jaeheui, Huh, Jaeyeon, Jung, Hanbin, Kang, Changjin, Kim, Beomgyu, Kim, Jihwan, Kim, Minjae, Kim, Taehwan, Kim, Youngrok, Lee, Haesol, Lee, Jeesoo, Lee, Kungyu, Oh, Dongpin, Park, Yeongjae, Ryu, Bokki, Suh, Daewon, Weon, Dongjoo

Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment.

large language model, machine learning, natural language, (18 more...)

2508.09148

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)